{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "2. spaCy. Eng.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Afanasyy/colab/blob/main/2_spaCy_Eng.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 0. Библиотеки для NLP\n",
        "\n",
        "NLP - natural language processing (обработка естественного языка)\n",
        "\n",
        "## Зачем нужны библиотеки?\n",
        "\n",
        "\n",
        "*   Tokenization\n",
        "*   Part-of-speech (POS) Tagging\n",
        "*   Lemmatization\n",
        "*   Named Entity Recognition (NER)\n",
        "*   Similarity\n",
        "*   Text Classification\n",
        "*   etc.\n",
        "\n",
        "## Библиотеки для NLP\n",
        "\n",
        "\n",
        "*   NLTK: English\n",
        "*   **spaCy**: English, Spanish, Russian, etc.\n",
        "*   natasha: Russian\n",
        "*   etc.\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "so3CpK1ejqlN"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 1. spaCy: Введение\n",
        "\n",
        "https://spacy.io/\n",
        "\n",
        "Данная версия ноутбука для английских моделей spaCy."
      ],
      "metadata": {
        "id": "FfwSH_5HmUYq"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Загрузка модели и установка модели"
      ],
      "metadata": {
        "id": "va0I5GM4zV_4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Существуют следующие английские модели:\n",
        "\n",
        "\n",
        "*   en_core_web_sm\n",
        "*   en_core_web_md\n",
        "*   en_core_web_lg\n",
        "*   en_core_web_trf\n",
        "\n",
        "https://spacy.io/models/en\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "6V9c5BbQqBiD"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "В отличие от русских моделей. в Google Colab необязательно загружать специальную версию spaCy и языковую модель en_core_web_sm."
      ],
      "metadata": {
        "id": "HPu8TT4-tNy9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import spacy\n",
        "nlp = spacy.load(\"en_core_web_sm\")"
      ],
      "metadata": {
        "id": "pAhYgCpzwIQW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "nlp"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FSV3iUgByzO6",
        "outputId": "217f3dad-059f-47c3-90dd-2f82783e2c70"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<spacy.lang.en.English at 0x7fc29d2f4a50>"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "В переменной nlp находится объект Language, у которого есть несколько встроенных функций."
      ],
      "metadata": {
        "id": "3vF7lZ3OzI5p"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Первые шаги"
      ],
      "metadata": {
        "id": "Sv6ipQLUzcBX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = 'Sharks, among the fiercest predators in the ocean, are also some of the most vulnerable. '\n",
        "doc = nlp(text)"
      ],
      "metadata": {
        "id": "WDa3s7pszH1e"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "type(doc)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9klENXn66PSl",
        "outputId": "e19f323e-ad7f-4039-a58d-3d913c3669f6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "spacy.tokens.doc.Doc"
            ]
          },
          "metadata": {},
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Что такое переменная **doc**? Это объект типа Doc, т.е. набор токенов, который содержит как исходный текст, так и всё, что обработала библиотека spaCy: леммы, именованные сущности, векторы слов и проч."
      ],
      "metadata": {
        "id": "es077NhH5il_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Токенизация"
      ],
      "metadata": {
        "id": "3K8IfthbzhQI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Токенизация** - процесс разделения письменного языка на предложения-компоненты (слова и не-слова, типа знаков препинания) - т.е. на **токены**. "
      ],
      "metadata": {
        "id": "eTlzNjSD7C9E"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for token in doc:\n",
        "  print(token.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PBcvSc4X7Wdh",
        "outputId": "a99a4cd5-c55b-4d86-f536-8f51a784a463"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Sharks\n",
            ",\n",
            "among\n",
            "the\n",
            "fiercest\n",
            "predators\n",
            "in\n",
            "the\n",
            "ocean\n",
            ",\n",
            "are\n",
            "also\n",
            "some\n",
            "of\n",
            "the\n",
            "most\n",
            "vulnerable\n",
            ".\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Выведем общее количество токенов в тексте:"
      ],
      "metadata": {
        "id": "2zoUuv698Zyn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"Всего в тексте {} токенов\".format(len(doc)))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9NSfXGtG8c9f",
        "outputId": "7f15d17b-72ed-4e0d-b2f6-62c1faec93b9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Всего в тексте 18 токенов\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Очистим текст от знаков препинания и стоп-слов (незначительных слов, которые часто встречаются в тексте)"
      ],
      "metadata": {
        "id": "YZbn0a-H_XpC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_doc = []\n",
        "for token in doc:\n",
        "  if not token.is_stop and not token.is_punct:\n",
        "    clean_doc.append(token)\n",
        "print(clean_doc)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wXxhTvp1_fSv",
        "outputId": "9bc1bebf-de29-4af7-893a-bd8d933cedf7"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[Sharks, fiercest, predators, ocean, vulnerable]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "#альтернативная запись кода сверху\n",
        "# clean_doc=[token for token in doc if not token.is_stop and not token.is_punct]"
      ],
      "metadata": {
        "id": "dGyb2XYIASzH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"В очищенном тексте {} токенов\".format(len(clean_doc)))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0gxcIEMU_uvp",
        "outputId": "60805eec-f20a-4ddd-9354-1b6146878b1d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "В очищенном тексте 5 токенов\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Стоит отметить, что каждый токен, находящийся в полученном очищенном массиве, является уникальным, так как содержит свою индивидуальную информацию."
      ],
      "metadata": {
        "id": "zTYyEhYBADBw"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Лемматизация"
      ],
      "metadata": {
        "id": "BVQNaFVb7jmR"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Лемматизация** - приведение слов к начальной форме (отдельные слова - **леммы**)."
      ],
      "metadata": {
        "id": "po9Gwvto7v0r"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for token in doc:\n",
        "  print(token.lemma_)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Om2ElQwn7Zgv",
        "outputId": "76239545-08b8-4846-8acc-355ce83301cf"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Sharks\n",
            ",\n",
            "among\n",
            "the\n",
            "fierce\n",
            "predator\n",
            "in\n",
            "the\n",
            "ocean\n",
            ",\n",
            "be\n",
            "also\n",
            "some\n",
            "of\n",
            "the\n",
            "most\n",
            "vulnerable\n",
            ".\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Вывод частей речи"
      ],
      "metadata": {
        "id": "m8sIoa_F8wB3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for token in doc:\n",
        "  print(token.text,'---- ',token.pos_)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cNyLX4sQ8zH2",
        "outputId": "41ce0ab8-4069-4e99-bb01-5baecbefa20f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Sharks ----  PROPN\n",
            ", ----  PUNCT\n",
            "among ----  ADP\n",
            "the ----  DET\n",
            "fiercest ----  ADJ\n",
            "predators ----  NOUN\n",
            "in ----  ADP\n",
            "the ----  DET\n",
            "ocean ----  NOUN\n",
            ", ----  PUNCT\n",
            "are ----  AUX\n",
            "also ----  ADV\n",
            "some ----  DET\n",
            "of ----  ADP\n",
            "the ----  DET\n",
            "most ----  ADV\n",
            "vulnerable ----  ADJ\n",
            ". ----  PUNCT\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Для определения обозначений можно использовать этот сайт: https://universaldependencies.org/u/pos/\n",
        "Также можно попросить spaCy объяснить, что значит то или иное обозначение:"
      ],
      "metadata": {
        "id": "Z6aX28LI9hgd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "spacy.explain('ADP')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "HMvGHN219jpu",
        "outputId": "90110522-a2ec-40ae-9cf7-1271008f77b5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'adposition'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Выделение именованных сущностей (NER)"
      ],
      "metadata": {
        "id": "Invqm6-HB7qo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Именованные сущности** - имена людей, названия компаний и проч."
      ],
      "metadata": {
        "id": "3tKNXnViCskM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = \"In Cerro Punta, Panama, a hummingbird pollinates an orchid.\"\n",
        "doc2 = nlp(text)\n",
        "print(doc2.ents)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iredH87PC0aO",
        "outputId": "3ad7f474-facc-4044-c6a4-9e0e61350937"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(Cerro Punta, Panama, hummingbird)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Также можно посмотреть, к какому классу относятся выделенные именованные сущности."
      ],
      "metadata": {
        "id": "AwWylUDRGEZW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for entity in doc2.ents:\n",
        "  print(entity.text,'--- ',entity.label_)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mt_-cDKqGKvC",
        "outputId": "10b775a1-3419-4b9a-862b-8a7b229f632a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Cerro Punta ---  GPE\n",
            "Panama ---  GPE\n",
            "hummingbird ---  ORDINAL\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Чтобы выделить именованные сущности в тексте, существует дополнение displacy"
      ],
      "metadata": {
        "id": "PYz-uoajGPyF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from spacy import displacy\n",
        "displacy.render(doc2, style='ent', jupyter=True)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "TGXbXTW7GbG2",
        "outputId": "3c27e82e-dfac-43f5-b784-271d16e00a75"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">In \n",
              "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
              "    Cerro Punta\n",
              "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
              "</mark>\n",
              ", \n",
              "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
              "    Panama\n",
              "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
              "</mark>\n",
              ", a \n",
              "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
              "    hummingbird\n",
              "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORDINAL</span>\n",
              "</mark>\n",
              " pollinates an orchid.</div></span>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Визуализация деревьев зависимостей"
      ],
      "metadata": {
        "id": "_Emqmg5TipG9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "При помощи spaCy можно также визуализировать дерево зависимостей, где показаны части речи и части предложения, а также отношение зависимостей."
      ],
      "metadata": {
        "id": "giKjujeiiuuD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from spacy import displacy\n",
        "text= \"In Cerro Punta, Panama, a hummingbird pollinates an orchid.\"\n",
        "doc3 = nlp(text)\n",
        "\n",
        "displacy.render(doc3,style='dep',jupyter=True)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 421
        },
        "id": "kJq5qdmBjF70",
        "outputId": "02b0f889-4cb2-499c-d392-42d73a710981"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"39a3d5886490484f8ddad07a5c22b429-0\" class=\"displacy\" width=\"1625\" height=\"399.5\" direction=\"ltr\" style=\"max-width: none; height: 399.5px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">In</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADP</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"225\">Cerro</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"225\">PROPN</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"400\">Punta,</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"400\">PROPN</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"575\">Panama,</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"575\">PROPN</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">a</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">DET</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"925\">hummingbird</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"925\">NOUN</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">pollinates</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">VERB</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1275\">an</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1275\">DET</tspan>\n",
              "</text>\n",
              "\n",
              "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"309.5\">\n",
              "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1450\">orchid.</tspan>\n",
              "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1450\">NOUN</tspan>\n",
              "</text>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-0\" stroke-width=\"2px\" d=\"M70,264.5 C70,2.0 1100.0,2.0 1100.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M70,266.5 L62,254.5 78,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-1\" stroke-width=\"2px\" d=\"M245,264.5 C245,177.0 390.0,177.0 390.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M245,266.5 L237,254.5 253,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-2\" stroke-width=\"2px\" d=\"M70,264.5 C70,89.5 395.0,89.5 395.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M395.0,266.5 L403.0,254.5 387.0,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-3\" stroke-width=\"2px\" d=\"M420,264.5 C420,177.0 565.0,177.0 565.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">appos</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M565.0,266.5 L573.0,254.5 557.0,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-4\" stroke-width=\"2px\" d=\"M770,264.5 C770,177.0 915.0,177.0 915.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M770,266.5 L762,254.5 778,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-5\" stroke-width=\"2px\" d=\"M945,264.5 C945,177.0 1090.0,177.0 1090.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M945,266.5 L937,254.5 953,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-6\" stroke-width=\"2px\" d=\"M1295,264.5 C1295,177.0 1440.0,177.0 1440.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M1295,266.5 L1287,254.5 1303,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "\n",
              "<g class=\"displacy-arrow\">\n",
              "    <path class=\"displacy-arc\" id=\"arrow-39a3d5886490484f8ddad07a5c22b429-0-7\" stroke-width=\"2px\" d=\"M1120,264.5 C1120,89.5 1445.0,89.5 1445.0,264.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
              "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
              "        <textPath xlink:href=\"#arrow-39a3d5886490484f8ddad07a5c22b429-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
              "    </text>\n",
              "    <path class=\"displacy-arrowhead\" d=\"M1445.0,266.5 L1453.0,254.5 1437.0,254.5\" fill=\"currentColor\"/>\n",
              "</g>\n",
              "</svg></span>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Описание синтаксических отношений между членами предложения можно посмотреть по ссылке: https://universaldependencies.org/en/dep/"
      ],
      "metadata": {
        "id": "4le7KyI2jhA8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Распознавание эл. почты"
      ],
      "metadata": {
        "id": "xRQYTi4mCA3F"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "С помощью spaCy также можно распознавать адреса электронной почты:"
      ],
      "metadata": {
        "id": "zaeRcDCrCl4N"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = \"Send your works to myname@hse.ru\"\n",
        "doc4 = nlp(text)\n",
        "for token in doc4:\n",
        "  if token.like_email:\n",
        "    print(token.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kcD2f3TCAhw_",
        "outputId": "bfcb0b78-2e1c-48f1-926a-d84e17b853b2"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "myname@hse.ru\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 2. spaCy для решения некоторых задач."
      ],
      "metadata": {
        "id": "KNHg0Orvb-Or"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Анонимизация/Маскирование"
      ],
      "metadata": {
        "id": "KtGIghmfojTi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = \"Tony Stark owns the company StarkEnterprises. Emily Clark works at Microsoft and lives in Manchester.\"\n",
        "doc = nlp(text)"
      ],
      "metadata": {
        "id": "lcSIi1bmoc3V"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "def update_article(text):\n",
        "  doc = nlp(text)\n",
        "  for word in doc:\n",
        "      if word.ent_type_ =='PERSON' or word.ent_type_=='ORG' or word.ent_type_=='GPE' or word.ent_type_=='LOC':\n",
        "        text = text.replace(word.text, 'UNKNOWN')\n",
        "  return text"
      ],
      "metadata": {
        "id": "_UGhsmv57BFS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "В отличие от модели русского языка, имена людей обозначаются тегом PERSON, а не PER."
      ],
      "metadata": {
        "id": "ISjpwB6Lrp6x"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "update_article(text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "jQudZAtH7dpz",
        "outputId": "d3406c58-4e28-45e8-f5d1-7a1112da360a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'UNKNOWN UNKNOWN owns the company UNKNOWNEnterprises. UNKNOWN UNKNOWN works at UNKNOWN and lives in UNKNOWN.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        ""
      ],
      "metadata": {
        "id": "Tv-dj4kErjpE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Схожесть текстов"
      ],
      "metadata": {
        "id": "DNajK1S1sk_F"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Напомним, что у каждого слова есть свое векторное представление. Т.к. векторы представляют собой численное представление, то они используются для различных NLP задач (н-р, для классификации).\n",
        "\n",
        "Слова, близкие по смыслу, будут иметь близкие векторные репрезентации."
      ],
      "metadata": {
        "id": "qxYsaltRsn3J"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "*Note: векторы присутствуют только в модели _lg, в модели _sm присутствуют только тензоры, чувствительные к контексту, поэтому точность такой модели будет ниже*"
      ],
      "metadata": {
        "id": "oqqfPCENul3X"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Чтобы проверить, есть ли вектор для слова, используется метод .has_vector"
      ],
      "metadata": {
        "id": "6tq7KfSZwScm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "tokens = nlp(\"I love dogs.\")\n",
        "for token in tokens:\n",
        "  print(token.text ,' ', token.has_vector)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LieSvbRluDEn",
        "outputId": "28221527-9f5b-49e3-ac6d-0c45953a8b38"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I   True\n",
            "love   True\n",
            "dogs   True\n",
            ".   True\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Чтобы получить вектор слова, необходим метод .vector"
      ],
      "metadata": {
        "id": "kpi7iYYkwyr6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "tokens = nlp(\"I love dogs.\")\n",
        "for token in tokens:\n",
        "  print(token.text ,' ', token.vector)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "RSDicLq-wg5R",
        "outputId": "8e8e7556-42b0-4b1c-d6f7-3c8b8ddf6c64"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I   [-1.0430092   2.8545923   2.0600038  -3.4211955   0.99649835 -1.1599871\n",
            "  2.749505    3.9365487   1.5571965   0.6664208   0.59999466  2.9803002\n",
            "  0.16377205  2.9965272   3.3764315  -0.89014435 -3.2101564   2.9714522\n",
            " -2.6696014  -1.6166049  -1.8172715  -2.7395675  -2.9825807  -3.6333983\n",
            " -2.8874328  -2.5601904  -4.4227586  -2.131781   -0.25393736 -0.20052847\n",
            "  1.6852098  -1.1961273  -0.9421311   3.4984024   7.282257   -0.04792026\n",
            " -2.7910743   5.4995027   2.4805403  -0.07623506  1.6081547  -1.8639393\n",
            " -1.5294671  -1.9714432   1.2756953   1.2758039   1.2159228  -1.8443458\n",
            " -1.8838339   1.1741894  -1.9123857  -2.5920343  -0.41342843  3.0118423\n",
            "  0.88132316 -0.07438791  1.0369622  -1.4144522  -0.17261651 -2.0494165\n",
            "  3.4219077  -2.8342834   0.00958729  5.4984064  -1.8489454  -2.511528\n",
            "  1.1873667  -4.1679707  -2.2335994   0.68960917 -1.3361545   1.2226154\n",
            " -0.9109669  -1.7300329   3.4517438  -2.4124196   1.8685429   0.06975681\n",
            "  1.8657271   0.08461511 -3.6619987   2.2404995  -5.398147    4.581538\n",
            "  2.7247477   3.5859838   4.5996675   0.96966726  1.1127226  -2.5290017\n",
            " -0.16042691 -2.6043274  -0.3350316  -1.3092078   0.6103209   0.95865226]\n",
            "love   [-2.0926478   0.8165207   0.6531313  -2.4250379  -3.2491603   2.9092011\n",
            " -3.0111592   1.5985104   1.4735556   3.8672802  -1.5705867  -1.0045079\n",
            "  3.4686258   3.053741    0.0220682   1.6985768  -2.312493   -1.7262487\n",
            "  1.1002957   1.2367346   0.7337955   0.8479775  -1.4502883   0.72715497\n",
            " -0.2981826  -2.580995    1.4671407  -4.2149014  -0.10922135  3.7058742\n",
            "  1.7134359  -1.2543209   1.4770842   0.11844853 -2.8699975  -3.0601144\n",
            " -0.80140805 -0.02256711  2.4317803  -0.82486874 -0.5977093   0.20079572\n",
            "  1.7992224  -2.643847    4.11491    -1.746896   -0.344749    0.57683814\n",
            " -2.8730056  -1.1731097  -2.2473903  -2.7736866   4.278227    1.284664\n",
            " -3.8134208  -2.759683    2.7348762  -1.3746341   0.51310825  8.658856\n",
            "  3.5977132  -2.5552604  -2.2948432  -1.3764751  -3.1880894   1.1603413\n",
            " -1.3831725  -1.4777741   0.93175805 -1.1381366   0.34955418  3.8045206\n",
            "  3.003464   -0.5803115  -0.06304705  0.3200203   1.8878257  -0.660155\n",
            "  0.38454205  0.1265434  -2.371572   -2.8394241  -2.2422209  -0.10410917\n",
            "  2.494587   -0.39664572  0.485955   -1.257124   -3.5717418   2.5171585\n",
            "  2.814128    4.7768383  -4.1206675   2.1601095   0.43720716  1.4954462 ]\n",
            "dogs   [ 2.8393528  -1.1312963  -0.11751944 -0.08010077  2.1061428  -3.1990964\n",
            "  0.923126    0.6573924   3.5902195  -0.29428005  0.4725598   0.17764413\n",
            "  1.9018803   1.297581   -1.6912122  -3.989336   -3.522654   -3.540039\n",
            "  0.46568882 -1.186924    1.1889963  -2.048414   -1.9185276   0.87987673\n",
            " -0.5404664  -0.8146296   0.57326716 -3.6189375  -0.06358629  2.696792\n",
            " -1.4618804  -1.1077768  -1.9731772   0.2689248   0.53459144 -4.3648596\n",
            "  0.91225064  1.7380898  -4.1165037   2.7106423   2.034181    2.861469\n",
            "  1.3985858  -2.6132827   0.59446687  2.8938382   0.6868919   0.8213347\n",
            " -2.5613382   1.2565048  -2.699062    1.1010365  -1.8702949  -1.3454587\n",
            " -0.8252897  -0.572858    0.7562732   0.44618732 -1.4631491  -1.4131639\n",
            " -0.2915418   8.324114   -3.087992   -2.700141    3.9717796   0.5890998\n",
            "  2.8838158  -2.7959366   1.0452635   0.37012506 -0.72057164 -1.0287178\n",
            "  0.21947849  2.5654914   1.3743399  -2.5482686  -1.7977817  -0.8036821\n",
            " -1.0948054   0.29791737 -0.54444504  4.2128034  -1.7309783  -1.3028209\n",
            "  1.4333308  -1.2037553   1.1549511   1.8365583  -0.659603   -0.7345972\n",
            "  1.6074127  -0.78991616  4.931231    1.3063501   0.44686097  7.188323  ]\n",
            ".   [-0.12938994 -0.71469605 -0.3640081   0.4233815   2.3338962  -0.21080992\n",
            "  2.9297519  -3.9241085   1.4564368  -2.7773795  -2.276672   -2.6692314\n",
            " -0.66740763  1.6892216  -3.5042648  -3.670801    3.0736852  -1.3254118\n",
            "  1.559247   -1.040769    2.0081518  -0.8765433   3.9257965   0.6208761\n",
            " -0.49966076 -0.0456332   2.1130955  -0.7641147   0.15661877 -0.11237016\n",
            " -0.26620656 -0.3048855   5.051791    4.483428   -1.7277452  -3.2551239\n",
            "  0.48837227 -1.8382733   0.7902676  -2.4668727  -2.1786537  -0.5026276\n",
            "  2.288679   -2.4495609   1.1811299  -3.07627     1.7830915  -1.9518898\n",
            " -3.4725814   0.5841797   2.2590346   0.7694223  -1.025698   -2.131234\n",
            "  1.8778348  -0.9619459  -0.65143514 -0.16396458 -2.442126    6.361784\n",
            " -1.9778142   4.389342    2.0689464  -1.9854448  -0.24451903 -0.4023331\n",
            " -3.1696262   0.28868896  0.44534492 -0.95210403  0.8234291  -1.7839663\n",
            "  1.1794956  -2.6875343  -1.5854799   1.7984003   0.6545099   4.499108\n",
            "  1.5603552   0.01739711 -2.3020434   0.6972862   4.267438   -0.82414305\n",
            "  0.8551713   2.5179853   0.9457948   1.045276   -2.2220998   0.41401815\n",
            "  1.8998947  -2.6696262  -0.19367981 -0.15285462 -2.028827   -0.6200065 ]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Можно также взять L2 норму векторов с помощью метода .vector_norm"
      ],
      "metadata": {
        "id": "yQVmLU5S5wd7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for token in tokens:\n",
        "  print(token.text ,' ', token.vector_norm)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YIJBUu2Bw1fz",
        "outputId": "d38b7f8e-0c1b-4ae1-9a2c-0b6d52623366"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I   24.42788\n",
            "love   22.964434\n",
            "dogs   22.114325\n",
            ".   20.95947\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Как найти **сходство двух токенов**? Для этого используется функция **similarity()**."
      ],
      "metadata": {
        "id": "7U_S8ixd59EW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "token_1=nlp(\"Bad\")\n",
        "token_2=nlp(\"Terrible\")\n",
        "\n",
        "similarity_score=token_1.similarity(token_2)\n",
        "print(similarity_score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UJumksqv6JbA",
        "outputId": "76b035b1-beb7-47b3-f0ba-997967c328f6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.6228420741704979\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/lib/python3.7/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.\n",
            "  \"__main__\", mod_spec)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "token_1=nlp(\"Bad\")\n",
        "token_2=nlp(\"Good\")\n",
        "\n",
        "similarity_score=token_1.similarity(token_2)\n",
        "print(similarity_score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3GujVROj8cD4",
        "outputId": "090e83f3-76f9-4bb6-eb54-ab7bfc728148"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.612674126957667\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/lib/python3.7/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.\n",
            "  \"__main__\", mod_spec)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "review_1=nlp(' The food was amazing')\n",
        "review_2=nlp('The food was excellent')\n",
        "review_3=nlp('I did not like the food')\n",
        "review_4=nlp('It was very bad experience')\n",
        "\n",
        "score_1=review_1.similarity(review_2)\n",
        "print('Similarity between review 1 and 2',score_1)\n",
        "\n",
        "score_2=review_3.similarity(review_4)\n",
        "print('Similarity between review 3 and 4',score_2)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HULPDmx08jN0",
        "outputId": "41d79921-d8d5-49f6-d91f-e2688f162d83"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Similarity between review 1 and 2 0.8357777268543993\n",
            "Similarity between review 3 and 4 0.3581141289329264\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/lib/python3.7/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.\n",
            "  \"__main__\", mod_spec)\n",
            "/usr/lib/python3.7/runpy.py:193: ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.\n",
            "  \"__main__\", mod_spec)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Обратите внимание на **UserWarning** message: так как в модели, которую мы используем, нет векторов, метод нахождения сходства будет использовать другие данные. Поэтому для более точного нахождения сходства загрузим другую модель."
      ],
      "metadata": {
        "id": "RJSjART16QHv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import spacy\n",
        "nlp = spacy.load(\"en_core_web_lg\")"
      ],
      "metadata": {
        "id": "L707siBk6kt6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "token_1=nlp(\"Bad\")\n",
        "token_2=nlp(\"Terrible\")\n",
        "\n",
        "similarity_score=token_1.similarity(token_2)\n",
        "print(similarity_score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vufn4ZcL7me1",
        "outputId": "474d6373-5923-4b6b-988d-bf8226a31b59"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.7739191815858104\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "token_1=nlp(\"Bad\")\n",
        "token_2=nlp(\"Good\")\n",
        "\n",
        "similarity_score=token_1.similarity(token_2)\n",
        "print(similarity_score)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "l_5eY3-k7ti-",
        "outputId": "0ff3d751-a6a6-41f2-c680-93ad765bac2f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.7355090324289566\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "review_1=nlp(' The food was amazing')\n",
        "review_2=nlp('The food was excellent')\n",
        "review_3=nlp('I did not like the food')\n",
        "review_4=nlp('It was very bad experience')\n",
        "\n",
        "score_1=review_1.similarity(review_2)\n",
        "print('Similarity between review 1 and 2',score_1)\n",
        "\n",
        "score_2=review_3.similarity(review_4)\n",
        "print('Similarity between review 3 and 4',score_2)\n",
        "\n",
        "score_3=review_2.similarity(review_4)\n",
        "print('Similarity between review 2 and 4',score_2)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wAsM31iJ7vkn",
        "outputId": "d6656fc5-5a59-426e-98fd-3c6583a57b2a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Similarity between review 1 and 2 0.9566212627033192\n",
            "Similarity between review 3 and 4 0.8461898618188776\n",
            "Similarity between review 2 and 4 0.8461898618188776\n"
          ]
        }
      ]
    }
  ]
}